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Abstract 

It is argued that, contrary to common wisdom, unbiasedness is not always a well grounded requirement. 
It is shown that in many cases, for a given unbiased estimator there is a simply derived biased estimator 
which gives results closer to the true value. 



1 Introduction and reminder 

For the purpose of devising the most efficient way of 
exploiting them, the results of physical experiments are 
generally regarded as realisations of random variables; 
statistical theory is then invoked to indicate efficient 
ways of using these sets of data for obtaining the values 
of physical parameters which are functionnally tied to 
the parameters of the probability distributions. In this 
framework, one calls estimators certain random vari- 
ables built on samples of potential observations which 
are used to evaluate part or all of the parameters of 
the underlying ('parent') probability distribution. The 
sample average is amongst the simplest examples: if 
the expectation value to of the parent distribution is 
unknown, the arithmetic mean X — — V* . Xs is the 
natural and, in many cases, the 'best' estimator that 
can be found to evaluate m . The word 'best' has been 
quoted in the preceding sentence for reasons that will 
soon become 'clearer'. PP 

In many cases, the sample is (rightly) assumed to 
be made of independent observations and -as we shall 
assume in the sequel- the underlying distribution is 
supposed to have moments up to second order: there 
exists an expectation value E[X] — to and a variance 
V[X] = E[(X - to) 2 ] = /j 2 12 To define our notation, 
we call A n the estimator built on a sample of size n 
and a the parameter to be estimated, but we shall 
freely drop the subscript when it is irrelevant. We also 
assume that, as for X , the first two moments of A exist. 



1.1 Estimator properties 

Since the estimation, i.e. the value taken by the es- 
timator after sampling, will be later used in place of 
the true value a, the one desirable property that A 
should possess can vaguely be expressed by demanding 
it to take values as close as possible to a. How close- 
ness is to be measured is the main issue in what follows. 



Because it is easier to think in terms of fixed val- 
ues rather than to keep in mind the full complexity of 
a probability distribution, one of the first ideas that 
comes to mind to satisfy this closeness requirement is 
to look for an estimator the expectation value of which 
is equal to the unknown parameter: -E[^4 n ] = a. Such 
an estimator is said to be unbiased. When the bias 
(i.e. the difference b n = E[A n ] — a) is not zero, it often 
happens that it tends to zero when the sample size 
grows without limit, in which case A n is said to be 
asymptotically unbiased 

At this point, it is important to stress that 
the only biases considered in the sequel are sta- 
tistical biases, due to mathematical properties 
of the estimators. The measurements which are 
the source of the data can be affected by systematic 
biases for instrumental reasons, a simple example of 
which being that of a counter which misses part of the 
'hits', thereby furnishing a systematically low count. 
We assume that this kind of bias is being taken 
care of by appropriate means and we only ad- 
dress the question of the statistical biases in 
this article. 

Therefore, over and above unbiasedness, the first 
quality which is demanded for an estimator is consis- 
tency: a consistent estimator must somehow ap- 
proach the value to be estimated when n goes to in- 
finity. The precise meaning of the word 'approach' in 
the previous sentence can vary according to the kind 
of stochastic convergence which is adopted, but it is 
usually understood to refer to convergence in prob- 
ability, that is: for any given e > 0, the probability 
that A n deviates from a by more than e has zero limit 
when n — ► oo. More formally: 

V?7 > 3N : Vrc > N P(\A n - a\ > e) < 77 

If £'[ J 4. n ] has a limit when n — > 00, one does not see 
how that limit could differ from a if the previous re- 
quirement is fulfilled, but the author knows of no proof 



of this without additionnal assumptions. 
The rationale for demanding consitency is fairly clear: 
it is 'obvious' (but can be false !) that averaging a large 
number of measures of the same quantity will yield a 
better estimate of that quantity; on the other hand, 
accumulating data would be of no use if the estimation 
were not getting closer to the searched for value when 
the amount of data grows. 

It is often written (see e.g. 0) that asymptotic prop- 
erties such as consitency have nothing to say for finite 
sample sizes, contrary to unbiasedness which is a prop- 
erty defined for finite (read: 'realistic') sets of observa- 
tions. We think statements like this, supposedly based 
on good old common sense, are very deceiving; indeed, 
the observed average never equals its expectation value 
which is also, in a sense, an asymptotic property. All 
that can be said is that the average of an unlimited 
number of realisations of A converges to a in some way. 
(The so-called law of large numbers, more on this 
below) But what is the relevance of all that for a single 
shot estimation built from a finite sample, especially 
if A doesn't have a small dispersion ? 5 Concen- 
tration is therefore another important quality and one 
also demands the estimator to have a 'small' root mean 
squared that is, yVL4 n ] should not be larger than the 
error one is ready to tolerate on a . 
Building estimators with variances going to zero in the 
infinite sample limit is often possible in simple prob- 
lems. Ideally, an estimator which is both asymptoti- 
cally unbiased and of zero asymptotic variance is all 
that is required, would data be available in arbitrary 
large amounts: one easily shows that such an estimator 
is consistent by using Huygen's theorem: 

E[{A n - a) 2 ] = V[A n ] + (E[A n ] - a) 2 

and Chebyshev's inequality: 

P(\A n - a\ > e) < ^E[(A n - a) 2 } 

By the same token, one sees that if consistency is un- 
derstood as convergence in quadratic mean, it is 
completely equivalent to the conjunction of the two 
asymptotic requirements just stated. 

All the insistence on expectation values and sample 
means comes from the above mentionned law of large 
numbers, of which there exist weak and strong vari- 
eties. For what concerns us, they both say that the 
arithmetic mean of n equally distributed independent 
random variables converges in probability (weak law 
of l.n.) or almost surely (strong law of l.n.) towards 
their common expectation value when n — > oo, as soon 
as this expectation value exists (analytically speaking) ; 



this explains why unbiasedness is expressed in terms of 
expectation values (but see note 3) and in simple cases, 
estimators are indeed averages of this kind for which 
the law applies. 

However, although people can gather only finite sam- 
ples, they tend to believe that their estimators will be 
'better' if they are already unbiased for finite sample 
sizes; they often make big calculational efforts to reach 
this aim - and spoil their estimators. This is the belief 
and the practice that we challenge in the following. 



2 Why is unbiasing not neces- 
sarily a good idea. 

2.1 Smaller variance or smaller bias ? 

There is a kind of trade-off between the two require- 
ments of low bias and low variance in certain cases. 
Let us assume that A n is multiplicatively biased, by 
this we mean that -B[A„] = fa where / is some posi- 
tive number =/= 1 which may be a function of n. 
If / is known, many practitioners of statistics will 
rather use A' n = A n /f. However, the variance of A' n is 
VL4^J = V[A n ]/f 2 and if / < 1 one gets unbiasedness 
at the price of a larger dispersion and there is no 
reason to believe that A' is better than A only 
because its expectation value equals a. Thinking 
so is somehow forgetting that a random variable is not 
its expectation value and unconsciously referring to the 
law of large numbers, which has, however, nothing to 
say about the relevance of an asymptotic property for 
a finite sample. 

2.2 What is closer ? 

Proximity will be dealt with in terms of distance, or 
difference. Definitions can vary and if the expected 
difference between A' and a is indeed zero by construc- 
tion, the real life difference between the values taken 
by A' and a is never zero. Therefore it is more real- 
istic to measure their distance by the mean absolute 
difference or the (root) mean square difference which 
is easier to handle, that is D 2 {A',a) = E[(A' — a) 2 ] 
which is simply V[A'] for unbiased A' . 
As for A, Huygens' theorem says that: D 2 (A,a) = 
V[A] + (E[A\ — a) 2 The variance is the minimum of 
the mean square distance about a fixed point, and this 
minimum is reached for the fixed point taken at the 
expectation value. Therefore, if A were additively bi- 
ased, subtracting off the bias would be the right thing 
to do. But this is not what we have in mind here. 
For A', the squared distance to a is V[A'\ = V[A}/f 2 



For A, it isV[A]+a 2 (l- ff 

The latter can be smaller than the former for / < 1 
and we shall base on this remark a general prescrip- 
tion for improving estimators, but before so doing, let 
us examine a specific and well known example. 

2.3 A simple example 

When it is required to estimate the variance of a dis- 
tribution, the mean of which is unknown, an 'obvious' 
estimator is the sample variance: 

S 2 = ^T,i( X i - x ) 2 ■ However, this S 2 is biased: 
E[S 2 ) = Ii — V2 which is precisely the kind of situation 
that we are considering here. More often than not, 
people replace S 2 by S' 2 = ^^(X.-X) 2 = ^S 2 
which has obviously a larger dispersion. 

To study the case further, let's make things sim- 
ple and assume that the parent (sample) distribution 
is gaussian. (The case of an arbitrary distribution is 
treated below) 

S 2 is then the estimator of ^,2 given by the maximum 
likelihood method when m is unknown, but again, 
most people shift to S' 2 because of the bias. However, 
it is particularly simple to show that one increases 
the dispersion of the estimator about /12 by using this 
recipe, jj] Indeed, it is well known that Q = ^- is 
X 2 -distributed with n — 1 degrees of freedom. There- 
fore E[Q] = n — 1, V[Q] — 2(n — 1) and one has 
E[S 2 } = 2=i/x 2 as it has to be, V[S 2 } = ^%! and 
V[S' 2 } = ^i 

But this entails that 

d 2 (s 2 ,^) = v[s 2 } + ^ 2 {-f = ^,, 2 < V[S' 2 } 

S 2 is therefore less dispersed about \xi than S' 2 and 
it makes little sense to prefer the latter on the grounds 
that it is unbiased. We can only disagree with, e.g. 0| 
who compare the bias with the loss in precision calcu- 
lated as the difference between the standard deviations 
of the two estimates and settle the matter by claiming 
that 'for large n this loss is very much smaller than 
the bias'. These are things that cannot be compared. 
Of course, our mean square distance criterion makes 
use of expectation values just as the no bias crite- 
rion, but a small D is much more meaningfull than a 
zero expectation value; since all contributions are pos- 
itive, they all add up in the calculation of D 2 and the 
true distance squared, in any given experiment, cannot 
be much larger than D 2 with any sizeable probability, 
whereas demanding no bias guarantees nothing of the 
kind since it can be achieved by compensation of large 



opposite sign contributions. |10| 

One can derive limitations on the probability of 
an absolute difference from bounds on the variance or 
on the expected absolute difference as examplified by 
Chebyshev's and Kolmogorov's inequalities. But no- 
body will ever succeed in deriving such a bound from 
a bound on the bias.. to put it otherwise, the absolute 
value of the integral of a function has much less to say 
about the size of that function than the integral of its 
absolute value. 

Clearly, using S* 2 will lead to average estimations 
below the true value of 112 hi the long run and the 
histogram built with many realisations of the Monte 
Carlo will not be 'centered' on the input value; many 
people would not like using S 2 precisely for that rea- 
son. We think the right answer is a flat: 'So, what 
?' The real question is: what are those estimates sup- 
posed to be used for ? If it is not to show colleagues 
how well you do in reconstructing the input param- 
eters of your Monte-Carlo, then such things as those 
histograms should not be considered as the primary 
criterion in assessing the quality of your estimators. 
People are taught and used to look at those features, 
but a minute of thought suffices to convince oneself 
that a centered histogram proves very little. Control 
histograms can be plotted with unbiassed estimators to 
show that 'everything is understood', but that doesn't 
validate the estimators for whatever subsequent use is 
made of the estimates. 

On the contrary, every student knows that, except for 
linear mappings, the expectation value of the transform 
is not the transform of the expectation value. There- 
fore, there is no real reason to insist on rigourously 
unbiased estimations. The perfectly legitimate require- 
ment of being as close as possible to the true value is 
often contradictory with the 'no-bias' criterion. 
To give yet another example: nobody would say that 
the mean distance to the origin in, e.g., a one dimen- 
sional, symmetrical random walk is zero on the grounds 
that the expectation value of the random walk is zero 
for even n ste p- The root mean square is the universally 
accepted measure of distance, hence the Jn s t e -p rule. 



3 If unbias doesn't help, what 
about..overbias ? 

3.1 Optimal bias 

Having thus set foot in the marshes of heresy, going 
forward is the only logical attitude. If S 2 , above, is 



better than S' 2 , what about n/(n + k)S 2 ? 

Finding the optimum value of k can be made by di- 
rect comparison: let S" 2 stand for the latter estimator 
Then {V[S' 2 ] - D 2 [S" 2 , M2 ])/ M | = ^(1 - (^±) 2 ) - 
my- in-i^+kyA Sn + Sk-nk-l) 
The largest difference obtains for k = 1 and is equal to 
2 _ 1 A t 2 The most 'concentrated' estimator about /j,2 is 
therefore S" 2 = ^ ^(X* - *) 2 
This result is but a particular case of a more general 
formula that will be derived below. 



3.2 A word about error compensation 

Since the most 'concentrated' estimator of /12 is 
S" 2 = -^iJ2i(Xi — X) 2 and since this is probably 
not unknown, one might ask why people keep on using 
the unbiased S" 2 instead. Besides the already alluded 
to histograms, the unconscious idea underlying the 
use of unbiased estimates is probably that fluctuations 
above and below the 'true value' (which is the expec- 
tation value of S' 2 in this case) should more or less 
compensate. We have already remarked that such a 
motivation is poorly grounded for a one-time estima- 
tion. But for the sake of the argument, let us take 
the idea seriously. The best estimator in that case 
should be such that its probability to be above the 
'true value' is equal to its probability to be below this 
value. In other words, for our example, ji2 should 
be the median of the distribution of the estimator 
rather than its mean. Let's therefore define an 'ideal' 
S 2 d proportionnal to S' 2 such that [ii be the median 

o /i2 '°u is 



id 

of its distribution. Let Sf d — aS' 2 Then 
X 2 distributed with n — 1 d.d.f. and the condition 
we impose calls for finding the median M n _i of the 
X 2 distribution. Numerical evaluation up to n = 400 
shows that the median of x 2 is always between n and 
n — 1, slowly decreasing and seemingly converging to- 
wards n — 2/3 but this value is, of course, only a guess. 



This means that a = 



M„ 



> 1 and therefore that 



Sf d is not below but above 5 contrary to the conclu- 
sion to which we were led by our distance argument. 
One has Sj d = ^E,(^ " X) 2 with r « 5/3 and 
the (mean squared) distance between S 2 d and \xi is 
^[<S?d + (tt^) 2 ^, larger than everything found so far. 

Facing this distressing result, one might think of a 
last way out for the case at hand: in the same spirit 
as our tentative use of the median and in line with 
the philosophy of the maximum likelihood method, 
one could assume that the best estimate of [ii is that 
which renders the value found for S' 2 most likely. Con- 



trary to the median case, it is quite easy to find by 
derivation that the most likely value (the so-called 
'mode') of a Xn distribution is n — 2. Therefore the 
maximum likelihood estimator of ^2 in this sense 
should be taken as ^S' 2 = ^£,(1; - X) 2 , still 
farther away from S 2 than the preceding estimate (re- 
call that S 2 is the maximum likelihood estimate for a 
gaussian sample with unknown mean). The maximum 
likelihood method seems therefore to suffer of some 
kind of schizophreny: the maximum likelihood estima- 
tor of fi2 based on the full sample distribution, that is 
S 2 , is not the same as the maximum likelihood estima- 
tor of the same parameter based on the distribution 
of this same S 2 , which is ^z%S 2 .. 



4 A general prescription 

4.1 Improving an unbiased estimator 

The lesson of the latter section is that 'compensation' 
arguments lead only to contradiction. Even the time 
honored maximum likelihood method is shown to be 
self- inconsistent. Aware of this fact, some people use 
M.L. only as a starting point to find some estimator 
which they further 'improve'. But as already remarked, 
the supposed improvement can spoil the result and this 
is particularly clear on the example that we have used. 

On the other hand, the minimum squared distance 
criterion is certainly better grounded than the no-bias 
prescription for reasons which have already been ex- 
plained. One then might ask for a general rule based 
on it. 

Using the notations of paragraph 2.2, the squared dis- 
tance of A to a can be written: 

D 2 (A, a) = V[A] + [E[A] - a) 2 = f 2 V[A'] + a 2 (f - l) 2 

Therefore, having found an unbiased estimator A' one 
can try to derive a smaller distance estimator by min- 
imizing the above expression w.r.t. /. Zeroing the 
derivative yields the condition: 



J Jm — 



E[A' 2 ] 



where use has been made of V[A'] = E[A' 2 ] — a 2 . Since 
a 2 = E[A'] 2 < E[A' 2 ], f m < 1 as expected. Starting 
from any unbiased A' , it is always possible to build, in 
principle, an improved estimator: 



A 



better 



E[A' 2 ] 



A' 



which will be closer to the unknown parameter than 
A'. It is, of course, biased, but its squared distance to 
a is easily seen to be reduced by a factor of f m 
Note that, at least for 'mean square' consistency, /,„ — > 
1 when n-KXJ because convergence in the mean square 
entails convergence of the first two moments of the dis- 
tribution towards those of a constant; in particular, 



ElA'i 



E\a 2 



so that as soon as they are built 



from consistent estimators, our biased estimators are 
themselves consistent. 

The expression found for Abetter seems to depend on 
the value to be estimated. However, alhough there 
exist indeed cases in which it is of no use, we'll see 
presently that there are some important simple prob- 
lems where it is perfectly usable. 

Moreover, even if the exact value is not known, any 
non trivial upper bound on f m (that is, smaller than 
1) will yield some improvement if used in place of f m 
to bias A'. 

It is important to observe here that the bias so in- 
troduced always tends to zero when n — > oo as soon as 
the (unbiased) estimator variance goes to zero. Indeed, 

2 

f™ = VWJ+tP ~* 1 in this case - 

4.2 Examples 

• Estimation of the variance of a gaussian distribu- 
tion with unknown expectation value. 

This is the already treated example. Here a = fi 2 
and E[A' 2 ] = V[S' 2 } + £ = j^tjj + \x\ there- 
fore / = ^i and S 2 etter = ^ £.(*, - X) 2 as 
already found. 

• Estimation of the variance of a gaussian distribu- 
tion with known expectation value. 

The unbiassed estimator is here 

So 2 4E,(^,-™) 2 

with V[nS 2 /fi 2 ] = 2n hence V[S$] = ^ 



Therefore / 



n+2 



and 



^better ~ n+2 2~ii\^i 



my 



• A variation on the first example can be found 
in the problem of the linear least square fit with 
gaussian errors. When the overall scale a 2 of the 
covariance matrix V — a 2 W of the observations 
is unknown, finding the parameter estimators is 
still possible (W is assumed to be known), but 
not so for their covariance matrix or for the vari- 
ance of a prediction. One can then estimate a 2 
using the fact that the residual quadratic form 
Qmin is x 2- distributed with n — k degrees of free- 
dom, with n the number of points and k the num- 
ber of estimated parameters (see e.g. @] ). One 



has Qmin = eW 2 £ with e the vector of residu- 
als, and an unbiased estimator of a 2 is therefore 
a 2 = *-&&. 

n—k 



According to our recipe, cr 2 etter 



n-k+2 



• Estimation of the variance of an arbitrary distri- 
bution with known expectation value. 

With So as here above for the unbiased estima- 
tor one finds: E[{S 2 ) 2 ] = 4,^^ - m) 4 + 
2E i5i (^ " m) 2 {X 3 - m) 2 ] = 1/14 + ^ 2 
The improved estimator is therefore: 
S Lter = ^+^T' 5 o with 7 the ratio ^ ( 7 equals 
3 for a gaussian distribution, which checks our 
preceding result). 

• Estimation of the variance of an arbitrary distri- 
bution with unknown expectation value. 

This calls for the more tedious calculation of the 
second moment of S" 2 defined above. One finds 
E[(S' 2 ) 2 ] = ^ + "Vf^V 2 , and the improved 
estimator can be written: 



o/2 
^better 



l + n-l + ^ 



-s'- 



Note that by Schwartz's inequality, 7 > 1 in ac- 
cordance with our calculations for the last two 
items. The second result yields a marginal im- 
provement even in the absence of a better knowl- 
edge of 7 than this trivial bound. 

• Estimation of the parameter of an exponential 
distribution. 

The density is -e~~ for t > and the unbi- 
ased estimator of r is f = — V\. T, with variance 
£ One finds hetter = ^pf Y, t T t 

• If, in the preceding exemple, one prefers to es- 
timate the rate A = r _1 , the M.L. estimator is 



A 



ZiTi 



The moments of A are easily computed 



by observing that 2 is distributed according to a 
7(n, A) law and by using the normalisation inte- 
gral: 

r(fc) = J °°x k X k ~ 1 e- Xx dx. A is biased but ^-A 
is not and one finds that the improved estimator 

IS Abetter = _ A 



• For an unbiased estimator which reaches the min- 
imum variance bound (Cramer-Rao inequality) 
the factor f m reads 



a* + l/J„(a) 



if^fc where 



/„ is the amount of information on a brought by 
the n-sample, viz: I n (a) = E[{^§§^) 2 ] with £ 
the likelihood of the sample. The lifetime esti- 
mator above is a case of that kind. 



For a last example, let us consider the max- 
imum likelihood estimator of the parameter 6 
of a uniform distribution on [0,9]. This is 
6 = supiXi were the {Xi} stands for the sam- 
ple. This estimator is easily shown to be biased: 
E[6] = ^rfO The unbiased estimator is therefore 
^^-supiXi from which one easily finds the im- 
proved Qbetter = #f SUPiXi 



5 Summary and conclusion 

It has been argued that the requirement of unbiascd- 
ness at the price of a larger mean square distance to 
the estimated parameter is not well grounded. Mean 
absolute differences or mean squared differences are 
clearly more meaningfull than the average of signed 
differences which can hide large fluctuation through 
compensation. It has been observed that the demand 
of a 'centered' histogram is a matter of habit, but has 
no meaning as to the optimality of the estimates for 
other purposes than checking calculations. Requiring 
histograms to be 'centered' in reference to the median 
would not be less legitimate. 

However, with the help of a definite example, it has 
been shown that attempts to use some kind of fantas- 
matic 'error compensation' through the use of mean, 
median or mode leads to contradictions and to the use 
of estimators with ever wider distributions. 
On the other hand, minimizing the mean square dis- 
tance gives a general prescription to improve on an 
unbiased estimator by biasing it to a slightly lower ex- 
pectation value. Even if the formula for the bias factor 
thus obtained is not always applicable because of the 
unknown quantities that it involves, it has been shown 
that it yields perfectly definite and usable results in 
some important cases. Any non trivial upper bound 
on this bias factor yields some improvement. 



In conclusion, unbiased estimators are certainly 
uscfull for constructing control histograms, but should 
not be automatically taken at face value when the prob- 
lem is that of using the estimates for further calcula- 
tions. 
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